A Study of Sindhi Related and Arabic Script Adapted languages Recognition

نویسندگان

  • Dil Nawaz Hakro
  • Abdullah Zawawi Talib
  • Zeeshan Bhatti
  • G. N. Moja
چکیده

1. INTRODUCTION The character recognition of the Roman type of languages especially English has come near to perfection and it is also considered as one of the successful application in the field of computer vision. The work on Arabic script and other scripts is being continued on; but the languages adopting Arabic script is very little while the work on Sindhi language is near to its origin. The Arabic script has more complexities as it's an entirely different script as compared to Roman script. A significant work also has been done on Indian local but Sindhi is lacking its fully functional OCR, although the remarkable work has been done on Sindhi Computing(Bhatti et al., 2014). This paper presents a review of the character recognition processes and image processing techniques applied in OCR systems. The techniques include text line segmentation, word and character segmentation and classification. The paper also looks in to the choices of researchers, made for their research in various languages all around the world. 2. PROPERTIES OF SINDHI LANGUAGE According to Moulana Ubedullah Sindhi a well know Islamic Philosopher and Scholar wrote in his book about Sindhi language, " The seven languages are the main languages in which Holy Books were sent and the remaining world languages are derived from these seven languages. Sindhi is one these languages with Arabic and Hebrew " (Allana, 2004). The rich historical background of Sindhi language can be inferred from the 5000 years Indus Civilization of Moen-jo-Daro near Larkana district of Sindh (AboutIndus, 2014). In (Al-lana, 2004) Dr. Nabi Bux Khan Baloch; a well-known Sindhi historian and scholar has categorized Originity of Sindhi into different opinions in which one of the opinion explains that Sindhi is a Sansakrit branch via Varchada Apabharansha. Sindhi Language is spoken by 18 million people in Pakistan as well as 2.8 million in India. Two common scripts, Arabic and Devanagri are used for writing Sindhi language. Arabic is the most common script used, by adding some modified letters to the Ara-bic letters. In India Sindhi is written in both scripts because it can also be written with Devanagri script. Sin-dhi Language has 24 more letters (total 52) than Arabic language with 28; some modified letters have been added with four dots to accommodate the different sounds. Sindhi has more vowels and consonant than Arabic and its neighbor language Urdu. The writing system follows the same style of …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Sangam: A Perso-Arabic to Indic Script Machine Transliteration Model

Indian sub-continent is one of those unique parts of the world where single languages are written in different scripts. This is the case for example with Punjabi, written in Indian East Punjab in Gurmukhi script (a Left to Right script based on Devnagri) and in Pakistani West Punjab, it is written in Shahmukhi (a Right to Left script based on Perso-Arabic). This is also the case with other lang...

متن کامل

Off-line Arabic Handwritten Recognition Using a Novel Hybrid HMM-DNN Model

In order to facilitate the entry of data into the computer and its digitalization, automatic recognition of printed texts and manuscripts is one of the considerable aid to many applications. Research on automatic document recognition started decades ago with the recognition of isolated digits and letters, and today, due to advancements in machine learning methods, efforts are being made to iden...

متن کامل

Instant Diacritics Restoration System for Sindhi Accent Prediction using N-Gram and Memory-Based Learning Approaches

--The script of Sindhi Language is highly complex due to many complexities including abundance of homographic words. The interpretation of the text turns so tough due to the possibility of multitudinal meanings associated with a homographic word unless given specific pronunciation with the help of diacritics. Diacritics help the readers to comprehend the text easily. Due to the rapidly developi...

متن کامل

Lexicon Reduction for Urdu/Arabic Script Based Character Recognition: A Multilingual OCR

Arabic script character recognition is challenging task due to complexity of the script and huge number of ligatures. We present a method for the development of multilingual Arabic script OCR (Optical Character Recognition) and lexicon reduction for Arabic Script and its derivative languages. The objective of the proposed method is to overcome the large dataset Urdu and similar scripts by using...

متن کامل

7-bit Meta-Transliterations for 8-bit Romanizations

[7-bit encoding, transliteration] We propose a general strategy for deriving 7-bit encodings for texts in languages which use an alphabetic non-Roman script, like Arabic, Persian, Sanskrit and many other Indic scripts, and for which there is some transliteration convention using Roman letters with additional diacritical marks. These schemes, which we will call \meta-transliterations", are based...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1412.4217  شماره 

صفحات  -

تاریخ انتشار 2014